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Abstract 



We study the use of inverse reinforcement learning (IRL) as a tool for the recog- 
nition of agents' behavior on the basis of observation of their sequential decision 
behavior interacting with the environment. We model the problem faced by the 
. agents as a Markov decision process (MDP) and model the observed behavior of 

c/3 ■ the agents in terms of forward planning for the MDP. We use IRL to learn reward 

O | functions and then use these reward functions as the basis for clustering or classi- 

fication models. Experimental studies with GridWorld, a navigation problem, and 
the secretary problem, an optimal stopping problem, suggest reward vectors found 
J> \ from IRL can be a good basis for behavior pattern recognition problems. Empir- 

f^*) ■ ical comparisons of our method with several existing IRL algorithms and with 

CO ' direct methods that use feature statistics observed in state-action space suggest it 

^sO , may be superior for behavior recognition problems. 

CO 

O ■ 1 Introduction 

co i 

The availability of sensing technologies, such as digital cameras, global position system, infrared 
sensors and others, makes the computer easily access the data recording the interaction between the 
agents and the environment. The new web technology also provides a large amount of background 
knowledge that describes the user behavior on the internet. More recent research has begun to build 
the behavior recognition system using the real-world data to understand the users' behavior. 

Many approaches have been proposed to understand the agents' goals/plans from the observation 
of the decision behavior. The entire trace of actions is to be recognized and matched against a plan 
library or a set of possible goals/plans. Despite of the success of these methods, they assume that 
the plan library, a set of possible goals or some behavior model are known beforehand and provided 
as an input. Goal information is often completely unknown in practice, however, and so it is difficult 
to model goals accurately. 

Consider some examples in the real-world. Human behavior contains more complex structure and 
relations. E.g. Kautz pointed out two basic structure for behavior: decomposition and abstraction 
[9]. Behavior can be decomposed into several events. For behavior recognition, Israeli security 
systems evaluate a series of events to reach a conclusion. Are they merely loitering in the area? Are 
they wearing a warm coat on a hot day? When a list of behavior hits a certain number, the system 
recognizes a potential threat. This method highly depends on the empirical experience and hardly 
recognizes the precise goal of the observed agent. To recommend personalized advertisement, the 
web companies hope to understand users' interest by analyzing the web browsing history. Are 
users interested in cameras, if they have booked a hotel recently? There is considerable interest in 
categorizing the users according to their interest, while it is more difficult to infer the precise goals 
of the users. Due to the time and space limitation for advertisement, effective identification of the 
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user's interest is of high importance for the companies. Another motivation for a new problem comes 
from domains like high frequency trading of stocks and commodities, where there is considerable 
interest in identifying new market players and algorithms based on observations of trading actions, 
but little hope in learning the precise strategies employed by these agents [22, 14]. 

In this paper, we propose a new problem, termed Behavior Pattern Recognition(BPR), that involves 
recognizing agents based on observation of their behavior in a sequential decision making setting. 
Broadly, the problem is to classify or cluster the agents according to the patterns of decision behav- 
ior that are learned from the samples of agents' decision-making process. The recognition problem 
can be framed as a classification or clustering problem: (1) Given observations of decision trajecto- 
ries consisting of sequential actions and given a label for each trajectory indicating which behavior 
pattern that agent has, the problem is to determine the behavior pattern for an agent with unlabeled 
trajectories. (2) Given only observations of the decision trajectories, the problem is to assign trajec- 
tories to clusters on the basis of similarity of behavior patterns. 

A direct solution to BPR problem is to program some heuristic rules to recognize the behavior by 
decomposing complex behavior into a series of simple events and then evaluating them to reach a 
conclusion. However, programming the rules is hard. In contrast to the manually coded rules, we 
propose a learning model for BPR problem, characterizing the decision behavior with high level fea- 
tures in terms of the underlying goals. Consider the problem of image recognition as an illustration 
of behavior recognition. In that problem, a computer learns to categorize images by representing ev- 
ery image as a multi-dimensional feature vector that consists of the components such as RGB color, 
texture, shape parameters or other advanced metrics. The key point of characterizing the behavior 
is how to effectively find a high level vector that represents the sequential behavior and encodes the 
information on patterns. From the perspective of decision- making process, the underlying goal of 
an agent is considered as an abstract representation of the behavior. If a decision-making process is 
modeled by MDP, the reward function is assumed to encode the goal of that agent. 

IRL [13] addresses the task of learning a reward function for given MDP that is consistent with obser- 
vations of optimal decision making for the process. An assumption is that the expert's goal/intention 
can be characterized by the reward function. If the expert is rational, the demonstration behavior 
should aim to maximize the long-term accumulative reward. We study the use of IRL to characterize 
the decision behavior, modeling the problem faced by the agents as a MDP and assuming the reward 
function of the MDP model as a high-level abstraction of the decision behavior. The motivation 
is that even when the true behavior is not rational and we can't learn the precise goals/decision- 
strategies, we still can categorize the agents by learning the reward functions that make MDP models 
approximate the observed behavior. 

IRL has received increasing attention in the machine learning field in recent years. Most of this work 
is focused on apprenticeship learning, in which IRL is used as the core method for finding decision 
policies consistent with observed behavior [1, 12, 20]. A number of IRL algorithms and modeling 
constructs have been proposed for apprenticeship learning or imitation learning, including Max- 
margin planning [16], gradient tuning methods [12], linear solvable MDP [8], bootstrap learning 
[3], feature construction [11], Gaussian process IRL [15] and Bayesian inference [4]. 

On two well-know sequential decision-making problems, we compare our method with several ex- 
isting IRL algorithms and with direct methods that use feature statistics observed in state-action 
space. Our main contributions include: (1) identification of a new learning task that categorizes 
agents by learning their behavior patterns; (2) design of simple methods to solve the BPR prob- 
lem that characterize behavior in original observation space; (3) development of a new model-based 
method to solve the BPR problem in MDP reward space; and (4) observation that our new method 
using reward space provides a formal way to solve the behavior recognition problem and performs 
superior to other methods. 



2 Preliminaries 



We define the input of BPR problem as a tuple B = (Di, D 2} ■ ■ ■ Djy), where D ni n e 
{1,2,..., N} is the observation of the n — th agent. For a classification problem, D n = {O n , y n ), 
where O n is a set of observed decision trajectories and y n is the class label for the n — th agent. 
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The agents, who have the same behavior patterns, are given the same class label. Similarly, in a 
clustering problem, D n only consists of the observed decision trajectories. 

We define the set of decision trajectories O n = {h 3 n }, j = 1,2,..., \O n \, where each trajectory h J n 
is defined as a series of state and action pairs: {(s, a)^}, t = 1,2, . . . ,\h J n \. Here, the s denotes the 
state for the decision problem and the a means the action selected by the agent at state s. 

To determine a label for an agent, we may develop a model to decompose the observed behavior into 
several events. Each event can be described by a complete or part of a decision trajectory. When a 
list of events hits a certain number, the model recognizes a label for an agent. However, this method 
requires a lot of domain knowledge and human experience to program the heuristic rules. 

Another way to solve BPR problem is to effectively represent the problem in a multi-dimensional 
space and then apply the learning algorithms. The decision-making process can be characterized in 
two layers. The outer layer characterizes the behavior by calculating some statistic information on 
the observed state and action. The inner layer is an abstraction of the behavior, which is related to 
the goal or the internal mind of the agent that determines the behavior fundamentally. 

3 Simple Representation of Behavior Recognition Problem 

In this section, we describe two methods in outer layer that categorize the decision-making agents 
just based on observation. 

The first method is called feature trajectory (FT). Assume the length of a decision trajectory is H. 
The vector to characterize the behavior in j — th decision trajectory is written as follows. 

f(K) = [si,ai,s 2 ,a 2 , ...,s H , a H ], 

where Si,i G {1,2,..., H} is a discrete random variable meaning the state index at i — th decision 
stage, and a% represents the action selected at state Sj. E.g., we have a problem that can be defined 
by 3 states and 2 actions. Then Sj G {1, 2, 3} and aj G {1, 2}. In the observation, every trajectory 
starts from the same initial state. Given the observation set O n for n — th agent, the feature vector 

/„ is obtained by computing this equation: /„ = J-r X^j=i' /(^n)' wnere the vector f(W n ) is 
preprocessed by scale-normalization before averaging. 

Then, the n— th agent is represented by a feature vector /„. Consider a supervised learning problem. 
Given a real valued input vector /„ e J and a category label y n G y, we aim to learn a function 

The second method is called feature expectation (FE), which has been widely used by appren- 
ticeship learning as a representation of the averaged long-term performance. Assume a basis 
function <f> : S — > [0, l] d , where S denotes the state space. The feature expectation /„ = 

J-T J2 St £h j 7V( s t)> where 7 G (0,1) is a discount factor. The associated apprentice- 

ship learning algorithms aim to find a policy that performs as well as demonstrations by minimizing 
the distance between their feature expectations. Here, we only use the observed state sequence to 
compute the feature expectation vector for an agent, where the 7 is manually defined constant, e.g. 
0.95. Then, the n — th agent can be represented by the vector /„ that is obtained from O n . 

4 A New Representation Model 

To solve the BPR problem with high-level feature representation, we propose to use the following 
steps. 

1. Given the BPR problem with input B, we use the set {£>„}, n G {1,2,..., N} to construct 
the state space S and action space A for the decision-making problem. 

2. For n — th observed agent, we assume an MDP model M = [S,A,R n ,^,V), where 
R n is the unknown reward function for this agent, 7 is the constant discount factor, and 
V = {P a } ae- 4 is a set of transition probability matrices P a for action a G A. The entries 
of P Q , written as P a (s, s')> gi ye tne probability of transitioning to state s' G S from state 
s G S given the action is a. The rows of P a , denoted P a (s, :), give a probability vector 
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of transitioning from state s to all the states in S. The V can be modeled using prior 
knowledge of the problem or estimated from the observed decision trajectories {£>„}. In a 
finite state space the reward function R n may be considered as a vector, r n , whose elements 
give the reward in each state. Here we expect that there exists an unknown reward function 
that can make the MDP find a policy as similar as the observed behavior. 

3. Apply IRL algorithms to learn the reward vector r„ for n — th agent. 

4. Estimate the reward vectors for every agent. Then the supervised learning problem is writ- 
ten as: given the real valued input vector r n G 1Z and the category label y n G y, we aim to 
learn a function h : 1Z — > y. 

5. Given a new observed agent, we repeat step 1-3 to get the reward vector for the agent and 
then predict the label for the behavior pattern using estimated function h : 1Z — > y . 

In MDP model, a policy is defined as a mapping tt : S — > A. The value function for a policy 
7T is V w (sq) = E[J2^1 R(st)\p(so) , tt] where p(s ) is the distribution of the initial state and 
the action at state s t is determined by policy ir. Similarly, the Q function is defined as Q(s, a) = 
R(s)+jJ2 s 'es s')V' K (s r ). At state s, an optimal action is selected by a* = max oe ^ Q(s, a). 

An instance of the IRL problem is written as a triplet B = (M \ r, p(r), O), where M \ r is a MDP 
model without the reward function and p(r) is prior knowledge on the reward. The vector p{r) can 
be a non-informative prior if we have no knowledge about the reward function or a Gaussian or other 
distribution if we model the reward as a specific stochastic process. 

We use an MDP to model the decision problem faced by an agent under observation. The reality 
of the agent's decision problem and process may differ from the MDP model, but we interpret 
every observed decision of the agent as the choice of an action in the MDP. The dynamics of the 
environment in the MDP are described by the transition probabilities V. These probabilities may be 
interpreted as being a prior, if known in advance, or as an estimation of the agent's beliefs of the 
dynamics. Next, we will show how to learn the reward functions by employing some exiting IRL 
algorithms. 



5 Bayesian framework for IRL 

Most existing IRL algorithms assume that the agents are perfectly rational and the observed behavior 
is optimal. Prominent examples include the model in [13], which we term linear IRL (LIRL) because 
of its linear nature, WMAL in [20], and PROJ in [1]. In these algorithms, the reward function is 
written linearly in terms of features as R(s) — Ui4>i{s) — uj t 4>(s), where <fi : S — > [0, l] d 

and iv T = [ui,u}2, ■ ■ ■ ,o>d]. 

Our computational framework uses Bayesian IRL to estimate the reward vectors in a MDP, which 
was initially proposed in [6]. The posterior over reward function for n — th agent is written as 

\o„\ 

p{r n \O n ) = p{O n \r n )p(r n ) oc J] j [ p(a\s,r n ). 

3 = 1 {s,a)eh 3 „ 

Then, the IRL problem is written as max r?i logp(O n |r„) + logp(r„). For many problems, however, 
the computation of p{r n \O n ) may be complicated and some algorithms use Markov chain Monte 
Carlo (MCMC) to sample the posterior probability. Considering the computation complexity to 
deal with a large number of IRL problems, we choose the IRL algorithms that have well defined 
likelihood function to reduce the computation cost. 

5.1 IRL with Boltzmann Distribution 

To model the likelihood function, some IRL algorithm in [2], which we call maximum likelihood 
IRL (MLIRL), uses Boltzmann distribution to calculate p(a\s, r n ) using p(a\s, r n ) = „ &Q{ e Q {B , a) ■ 
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5.2 IRL with Gaussian Process 

IRL algorithm, which is called GPIRL in [15], uses preference relations to model the likelihood 
function P(O n \r n ) and assumes the r n is generated by Gaussian process for n — th observed agent. 

Given a state, we assume that the optimal action is selected according to Bellman optimality. The 
preference relation is defined as follows. 

At state s, Va, a G A, we define the action preference relation as: 

1. Action d is weakly preferred to a, denoted as a h s d, if Q(s, a) > Q(s, a); 

2. Action a is strictly preferred to a, denoted as a y s d, if Q(s, a) > Q(s, a); 

3. Action a is equivalent to a, denoted as a ^ s a, if and only if a ^ s a and a ^ s a. 

Given the observation set O n , we have a group of preference relations at each state s, which is 
written as 

£ = {(a y s a), aei, a G A \ A} U {(a ~ s a'), a, a' G A} , 
where A G A is the action subspace for state s obtained from the set O n . 

Let r be the vector of r n containing the reward for m possible actions at T observed states. We have 
r = (r 0l (si),...,r 0l (sT),...,r 0m (si),...,r 0m (sT)) 

S v ' S v ' 

where T = \S\ and r flm , Vm G {1,2,..., \A\}, denotes the reward with respect to m-th action. 

Consider r Qm as a Gaussian process if, for any {si,-- - ,st} G S, the random variables 
{r am (si), • • • ,r am (sT)} are normally distributed. We denote by k am (s c ,s d ) the function gen- 
erating the value of entry (c, d) for covariance matrix K flm , which leads to r Qm <~ iV(0,K am ). 
Then the joint prior probability of the reward is a product of multivariate Gaussian, namely 

p(r\S) = Ilmii P( r a m |<5) and r ~ 7V(0,K). Note that r is completely specified by the posi- 
tive definite covariance matrix K. 

A simple strategy is to assume that the \A\ latent processes are uncorrected. Then the covariance 
matrix K is block diagonal in the covariance matrices {Ki, K^}. In practice, we use a squared 
exponential kernel function, written as: 

k am (s c , s d ) - e i(*c- S<i )M am ( Sc - s<i ) + a 2 a J{s c , s d ), 

where M am = K am lT and It is an identity matrix of size T. The function 6(s c , s d ) — 1, when s c — 
s d ; otherwise 5(s c , s d ) — 0. Under this definition the covariance is almost unity between variables 
whose inputs are very close in the Euclidean space, and decreases as their distance increases. 

Then, the GPIRL algorithm estimates the reward function by iteratively conducting the following 
two main steps: 

1. Get estimation of tmap by maximizing the posterior p(r n \O n ), which is equal to minimize 
— log p (£>„!?-„) — log p(r n \0), where 9 = (K am7 a am ) is the hyper-parameter controlling 
the Gaussian process, and p(O n \r n ) = Y\p((a a))Y\p((a ~ s a')). Above optimiza- 
tion problem has been proved to be convex programming in [15]. 

2. Find the optimized hyper-parameters by applying gradient decent optimization method to 
maximize logp(0„|#, Tmap)> which is the Laplace approximation of p{6\O n ). 

6 Experimentation 

Our experiments are designed to evaluate the performance of IRL algorithms in behavior recognition 
in comparison to methods that use simple feature representations obtained directly from observation 
space. We study two problems, GridWorld and the secretary problem. GridWorld provides insight 
into the task of recognizing machine agents for decision problems that may be modeled using MDP 
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models under the strict rationality assumption. The secretary problem provides an environment in 
which the agents do not act with respect to the solution of an MDP. Agents in the secretary problem 
employ heuristic decision rules derived from experimental study of human behavior in psychology 
and economics. 

To evaluate the recognition performance, we use the following algorithms: (1) Clustering: Kmeans 
[7]; (2) Classification: Support vector machine (SVM), K-nearest neighbors (KNN), Fisher dis- 
criminant analysis (FDA) and logistic regression (LR) [7]. We use clustering accuracy [21] and 
Normalized Mutual Information (NMI) [19] to compare clustering results. 



6.1 GridWorld problem 

In the GridWorld problem, which is used as a benchmark experiment by Ng and Russell in [13], an 
agent starts from a given square and moves towards a destination square. The agent has five actions 
to take: moving in the four cardinal directions or staying put. With probability 0.65 the agent moves 
to its chosen location, with probability 0. 15 it stays in the same location regardless of chosen action, 
and with probability 0.2 it moves in a random cardinal direction. 

The IRL problem for GridWorld is to recover the reward structure given the observations of agent 
actions. To produce these observations, we first simulate the agent's behavior using the optimal 
solution of an MDP to decide how to move in the GridWorld. We then collect observation data by 
sampling the simulated movement. Note that the reward function of the MDP used for simulating 
the agents is not known to the IRL learner. 

We investigated the behavior recognition problem in terms of clustering and classification on a 
10 x 10 GridWorld problem. Experiments were conducted according to the steps in Algorithm 1. 



Algorithm 1 GridWorld experimentation steps 

1: Input the variables S, A and V ■ Design two ground truth reward functions written as r\ and rj. 

2: Simulate agents and sample their behavior. 

3: fori = 1 -> 2 do 

4: for j = 1 -> 200 do 

5: Model an agent using M = (S,A,V,rij,j), where the reward = r* + random Gaussian noise. 

6: Sample decision trajectories Oij, and make the ground truth label yij = 0, if i = 1; yij = 1, if i = 2. 

7: end for 

8: end for 

9: IRL has access to the problem B = (S, A, V, 7, Oij) for this agent, and then infers the reward Tij. 

10: Recognize these agents based on the learned r^. 



The simulated agents in our experiments have hybrid destinations. A small number of short decision 
trajectories tends to present challenges to action feature methods, which is an observation of particu- 
lar interest. Additionally, the length of trajectories may have a substantial impact on performance. If 
the length is so long that the observed agent reaches the destination in every trajectory, the problem 
can be easily solved based on observations. Thus, we evaluate and compare performance by making 
the length of decision trajectory small. 
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Table 1 : NMI scores for GridWorld problem 



Figure 1 : Clustering accuracy 
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Figure 2: Classification results with respect to different classifier. 



Table 6.1 displays NMI scores and Figure 1 shows clustering accuracy. The length of the trajectory 
is limited to six steps, as we assume the observation is incomplete and the learner does not have suf- 
ficient information to infer the goal directly. Results are averaged over 100 replications. Clustering 
performance improves with increasing number of observations. When the number of observations 
is small, GPIRL method achieves high clustering accuracy and NMI scores due to the advantage of 
finding more accurate reward functions that can well characterize the decision behavior. The IRL 
algorithms, such as PROJ and WMAL, are not effective in this problem because the length of the ob- 
served decision trajectory is too small to provide a feature expectation that is a good approximation 
to the agent's long-term goals. Considering the utilization of feature learning algorithms to improve 
the simple feature representations, we also did experiments with PCA-based features where the 
projection sub-space is spanned by those eigenvectors that correspond to the principal components 
c = 10, 20, . . . , 90 for FE and c = 2, 4, 6, 8, 10 for FT. No significant changes in the clustering NMI 
scores and accuracy scores are observed. Therefore, we do not show the performance of PCA-based 
features in Table 6. land Figure 1. 

Figure2 displays classification accuracy for a binary classification problem in which there are four 
hundred agents coming from two groups of decision strategies. The results are averaged over 100 
replications with tenfold cross-validations. Four popular classifiers (SVM, KNN, FDA and LR) are 
employed to evaluate the classification performance. Results suggest that the classifiers based on 
IRL perform better than the simple methods, such as FT and FE, particularly when the number of 
observed trajectories and the length of the trajectory are small. The results support our hypothesis 
that recovered reward functions constitute an effective and robust feature space for clustering or 
classification analysis in a behavior pattern recognition setting. 

6.2 Secretary problem 

The secretary problem is a sequential decision-making problem in which the binary decision to either 
stop or continue a search is made on the basis of objects already seen. As suggested by the name, 
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Algorithm 2 Experimentation with Secretary Problem 

1 : Given a heuristic rule with a parameter h, k or I. 

2: Add random Gaussian noise to the parameter, which is written as p. 

3: Generate new secretary problem with X applications and let n — th agent solve these problems using this 

heuristic rule with its own parameter p. Save the observed decision trajectories into O n . 
4: Model the secretary problem in terms of an MDP consisting of the following components: 

1. State space S = {1,2,..., X}, where s e S means that at time s the current applicant is a 
candidate. 

2. Action space A consisting of two actions: reject and accept. 

3. Transition probability V , computed as follows: given the reject action, the probability of transi- 
tioning from state Sj to Sj, p(sj\si), is s .^ i _ 1 - ) if Sj > s;, and otherwise; given the accept 

action, the probability of transitioning from state Si to sj, p(sj\si), is 1 if Si = sj, and other- 
wise. 

4. The discount factor 7 is a selected constant. 

5. The reward function is unknown. 

5: Infer the reward function by solving an IRL problem B — (5, A, V, 7, O n ). 



the problem is usually cast in the context of interviewing applicants for a secretarial position. The 
decision maker interviews a randomly-ordered sequence of applicants one at a time. The applicant 
pool is such that the interviewer can unambiguously rank each applicant in terms of quality relative 
to the others seen up to that point. After each interview, the decision maker chooses either to move 
on to the next applicant, forgoing any opportunity to hire the current applicant, or to hire the current 
applicant, which terminates the process. If the process goes as far as the final applicant, he or she 
must be hired. Thus the decision maker chooses one and only one applicant. The objective is to 
maximize the probability that the accepted applicant is, in fact, the best in the pool. 

To test our hypotheses on BPR, an idea experiment would involve recognizing individual human 
decision makers on the basis of observations of hiring decisions that they make in secretary problem 
simulations. Experiments with human decision making for the secretary problem are reported on in 
[18, 17], but raw data consisting of decision maker action trajectories is not available. However, a 
major conclusion of these studies is that the decisions made by the humans largely can be explained 
in terms of three decision strategies, each of which uses the concept of a candidate. An applicant is 
said to a candidate he or she is the best applicant seen so far. The decision strategies of interest are 
the: 

1 . Cutoff rule (CR) with cutoff value h, in which the agent will reject the first h — 1 applicants 
and accept the next candidate; 

2. Successive non-candidate counting rule (SNCCR) with parameter value k, in which the 
agent will accept the first candidate who follows k successive non-candidate applicants 
since the last candidate; and 

3. Candidate counting rule ( CCR) with parameter value £, in which the agent selects the next 
candidate once t candidates have been seen. 

The optimal decision strategy for the secretary problem is to use CR with a parameter that can be 
computed using dynamic programming for any value of n, the number of secretaries. As n grows, 
the optimal parameter converges to n/e and yields a probability of successfully choosing the best 
applicant that converges to 1/e. Thus only one of the three decision strategies enumerated above can 
be viewed as optimal, and that only for a single parameter value out of the continuum of possible 
values. Human actions are usually suboptimal and tend to look like mixtures of CR (with a non- 
optimal parameter), SNCCR, and CCR [18]. 

As a surrogate for the action trajectories of humans, we use agents that we generate action trajec- 
tories for randomly sampled secretary problems using CR, SNCCR, and CCR. For a given decision 
rule (CR, SNCCR, CCR), we simulate a group of agents that adopt this rule, differentiating individ- 
uals in a group by adding Gaussian noise to the rule's parameter. The details of the process are given 
in Algorithm 2. We use IRL and observed actions to learn reward functions for the MDP model 
given in Algorithm 2. It is critical to understand that the state space for this MDP model captures 
nothing of the history of candidates, and as a consequence is wholly inadequate for the purposes of 
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Figure 3: Vectors with ground truth label projected in 2D space. The feature expectation vector is 
on the left and the reward vectors recovered by IRL are on the right. 
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(a) Reward vectors projected in 2D space (b) Reward vectors projected in 3D space 

Figure 4: Visualization of a binary classification problem for subjects using cutoff rule and random 
rules. The PROJ IRL algorithm is used to recover the reward vectors. 



modeling SNCCR and CCR. In other words, for general parameters, neither SNCCR nor CCR can 
be expressed as a policy for the MDP in Algorithm 2. (There does exist an MDP in which all three 
of the decision rules can be expressed as policies, but the state space for this model is exponentially 
larger.) Hence, for two of the rules, the processes that we use to generate data and the processes we 
use to learn are distinct. 

As an initial set of experiments, we generated an equal number of agents from each rule. All the 
heuristic rules use the same parameter value. We have compared the method using statistical feature 
representations obtained from the raw decision trajectories and our IRL model-based method. We 
employ 10 fold cross-validation to obtain the average accuracy, and it is always 100% . 

Given that perfect classification performance was achieved by all algorithms, the problem of recog- 
nizing across decision rules appears to be quite easy. A more challenging problem is to recognize 
variations in strategy within a single decision rule. For each rule, we conducted recognition experi- 
ments in which 300 agents were simulated, 100 each for three distinct values of the rule parameter. 
Individuals were differentiated by adding random noise to the parameter. Here, we show the compar- 
ison of the clustering performance between the simple method called FE and our MDP model-based 
method. In Figure3, the left figure displays an area marked "uncertainty" for the method called FE, 
while the right figure shows that the reward vectors have lower variance in the same group and higher 
variance between different groups. Figure3 intuitively demonstrates that when the agents' behavior 
is represented in the reward space, the recognition problem becomes easier to solve. 

Table 2 summarizes the NMI scores for using K-means clustering algorithm to recognize variations 
in strategy within one heuristic decision rule. The column called H in Table 2 records the number of 
decision trajectories that have been sampled for training. Table 2 proves that the feature representa- 
tion in reward space is almost always better than the representation with statistical features computed 
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Table 2: NMI score for Secretary Problem 



from the raw observation data. Moreover, the reward space can particularly better characterize the 
behavior when the scale of the observation data is small. Note that though the MDP model cannot 
generate the policy that is consistent with the SNCCR and CCR rules, the reward vectors learned in 
the MDP environment still make the clustering problem easier to solve. 

Figure4 shows a binary classification result of using PROJ algorithm to learn the reward functions 
for the agents in Secretary problem and then categorize the agents into two groups. In this classifi- 
cation experiment, the users' ground truth label is either cutoff decision rule or random strategy that 
makes random decisions. 

7 Conclusions 

We have proposed the use of IRL to solve the problem of behavior pattern recognition. The observed 
agent does not have to make decisions based on an MDP. However, we model the agent's behavior 
in an MDP environment and assume that the reward function has encoded the agent's underlying 
decision strategies. Numerical experiments on GridWorld and the secretary problem suggest that 
the advantage that IRL enjoys over action space methods is more pronounced when observations 
are limited and incomplete. We also note that there is seems to be a positive correlation between 
the success of IRL algorithms in apprenticeship learning (cf. [15]) and their success in the behavior 
recognition problem. To some degree, this relationship parallels results from [10, 5], where ap- 
prenticeship learning benefits from a learning structure that based on sophisticated methods for task 
decomposition or hierarchical identification of skill trees. 

Validation of the ideas proposed here can come only through experimentation with more difficult 
problems. Of particular importance would be problems involving human decision makers or other 
real-world scenarios, such as periodic investment, gambling, or stock trading. 
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